Recap: files, scripts and plotting functions

Plotting and generating artificial example data for plotting

  • The plot() command is a generic function that (often) automatically adopts to data type.
  • We will be using plot() and a few of its cousins, but their parameters (like color) usually all have the same names.
  • Arguments have names, but names can be omitted if using them in their intended order. You can always check the parameters of a function by executing help(functionname) or searching for the function by name in the Help tab on the right.
plot( c(1,4,10) ) # the basic plot command

# note: the c() command *c*ombines values into a vector; this will be useful later

We will be using some datasets built into R, as well as randomly generated data. It is not difficult to import your own data into R, but this is also a very short workshop. Importing data is covered in the last “bonus” section below if you’re interested.

runif(n = 1) # this command generates n random numbers
## [1] 0.4828785
sample(x = c(1,4,10), size = 1) # this produces a number of samples from the provided vector (x)
## [1] 10

Numerical data

Numerical values include things we can measure on a continuous scale (height, weight, reaction time), things that can be ordered (“rate this on a scale of 1-5”), and things that have been counted (number of participants in an experiment, number of words in a text).

A single numeric variable

We will use a built-in dataset called “iris” - it contains information about a bunch of flowers.

data("iris")  # load the data into the workspace (or "global environment").
# In RStudio, you can have a look at the dataframe by clicking on the little "table" icon next to it in the Environment section (top right).

# We can also inspect the data using R commands.
head(iris)     # prints the first rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)  # produces an automatic summary of the columns
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# Plotting time! Let's see for example how long the petals are in the dataset
iris$Petal.Length       # the $ is used for accessing (named) column of a dataframe
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
iris[, "Petal.Length"]  # this is the other indexing notation
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
plot(iris$Petal.Length) # two observations: there is quite a bit of variation, and it seems there are clusters in the data

hist(iris$Petal.Length, breaks=25) # a histogram shows the distribution of values (breaks change resolution)

boxplot(iris$Petal.Length)         # a boxplot is like a visual summary()
points(x=rep(1, nrow(iris)), y=iris$Petal.Length, col = rgb(0,0,0, alpha=0.2)) # could also add actual datapoints

# Plot boxplots, grouped by the Species variable:
boxplot(iris$Petal.Length ~ iris$Species) # note the ~ notation
grid() # why not add a grid for reference

# A slightly nicer version:
boxplot(iris$Petal.Length ~ iris$Species, ylab="petal length", border=c("purple", "darkblue", "lightblue"), boxwex=0.5, cex=0.4) 
abline(h=1:7, col=rgb(0,0,0,0.1))  # add vertical lines instead of full grid

A Note on the rgb(red, green, blue, alpha) function: this allows making custom colors; alpha controls transparency. Possible values range between 0 and 1 by default. Below is a piece of code that generates an example of how the color scheme works (don’t worry if you don’t understand the actual code, this is above the level of this workshop; just select the entire block and run it - CTRL+SHIFT+ENTER (or CMD+SHIFT+ENTER on Mac) is a handy shortcut for this).

Inspecting two numeric variables

This short workshop is not intended to cover statistical concepts like correlation, but we will nevertheless quickly look at plotting two variables against each other.

plot(iris$Sepal.Length, iris$Sepal.Width) # no interaction?

# Why not color-code by species. This method also makes use of the [ ] indexing.
iriscolors= c("purple", "darkblue", "lightblue")
plot(iris$Sepal.Length, iris$Sepal.Width, 
     col=iriscolors[iris$Species], pch=20) # pch sets the point type
legend("topleft", pch=20, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n") # add legend

# Whoa... This is suddenly a lot of code out of nowhere. If some of it looks overwhelming, worry not, everything will become clear once you get into R a bit more. 

# Add detail to make this legible to the colour-blind, and printable in black-and-white
irispoints = c(15,16,17)  # see help(points) for more
plot(iris$Sepal.Length, iris$Sepal.Width, 
     col = iriscolors[iris$Species] , pch = irispoints[iris$Species])
legend("topleft", pch=irispoints, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n")

# Make this publication-ready by adding proper labels
plot(iris$Sepal.Length, iris$Sepal.Width, 
     col=iriscolors[iris$Species] , pch=irispoints[iris$Species],
     main="Iris sepals in three species", xlab="Sepal length", ylab="Sepal width")
grid(col=rgb(0,0,0,0.2))
legend("topleft", pch=irispoints, legend = levels(iris$Species), col=iriscolors, cex=0.7, bty="n")

## Bonus: plotting regression lines
# This workshop does not cover actual statistical techniques, but in case you ever need to plot a regression line, this is pretty simple in R:

# another iris plot
plot(iris$Sepal.Length, iris$Petal.Length, pch=irispoints[iris$Species])
grid(col=rgb(0,0,0,0.2), lty=1)

# do the regression analysis:
linmodel = lm(iris$Petal.Length ~ iris$Sepal.Length)
abline(linmodel, col=rgb(0,0,0,0.3), lwd=4) # abline can handle the output of the lm (linear model) command

# Of course we already recognized that the data is more complex as initially thought (consisting of three distinct groups), so a proper regression model should take that into account.

Time series

While a whole subject on its own, we will have a quick look at plotting time series - data reflecting changes in some variable over time.

# This time we'll generate some random data and pretend it's real data.
# "The following data are reaction times to stimuli of one individual, over 100 trials, in an experiment on..."
retime = c(runif(20, 0,0.1), seq(0.1, 2.8,length.out = 80)) *  runif(100, 0.7, 1.1 )

# Have a look at the raw data first

# Now let's plot it
plot(retime, ylab="reaction time", type="l" ) 

What can you tell by the looks of the data?

Exercise: improve this plot by adding the ‘type’ parameter and setting its value as “l” (which stands for line, which are more useful in this instance), and set the X axis label (xlab) to say “trials” instead of the default “Index”.

## Bonus: plotting multiple (although still artificial) time series.
# First we need to create more data; let's bind it into a matrix as well
retimes = rbind(subj1=retime, subj2=sort(rnorm(100, 1,0.3))*runif(100,0.4,0.8), subj3=sort(rnorm(100, 1,0.3))*runif(100,0.9,1.1) ) # rows are subjects, columns are trials
plot(NA, ylim=c(0,3), xlim=c(0,100), ylab="reaction time", xlab="trials"); grid()
# create an empty plot beforehand, then add a line per subject ( [1,] indexes the first row, etc)
lines(retimes[1, ], col="darkred")
lines(retimes[2, ], col="darkblue")
lines(retimes[3, ], col="darkgreen")

Categorical data

Categorical/nominal/discrete values cannot be put on a continuous scale or ordered, and include things like binary values (student vs non-student) and all sorts of labels (noun, verb, adjective). Words in a text could be viewed as categorical data.

Categories and contingency tables

Here is another artificial dataset. Let’s pretend I went around Edinburgh and asked random people on the street the following question: “A new species of insect was recently discovered in Scotland, and they called it Boubicus Boubasus - or Bouba for short. What’s your intuition, is Bouba a big fat bug, or a small slim bug?” (and the same for Kikis Kikosius, or Kiki for short)

boubakiki = data.frame(meanings=c(sample(c("big", "small"),20,T, prob=c(0.8,0.2)), sample(c("big", "small"),20,T, prob=c(0.3,0.7))), words=c(rep("bouba",20), rep("kiki",20)) ) # this command will create the semi-random data

# Have a look at the raw data first.

# Now let's use the table() function to make sense of it:
bktable = table(boubakiki)
mosaicplot(bktable, col=c("orange", "navy"))                    # a simple mosaic plot

barplot(bktable, ylab="big    small")  # another approach using barplot

## Bonus:
# You *could* also use heatmap(bktable), but a heatmap does not make much sense with just 2x2 categories - it's easier to just look at the actual table. To get an idea how it would look like with larger contingency tables, check out this randomly generated plot: 
heatmap(matrix(runif(20^2), 20,20, dimnames=list(letters[1:20], letters[1:20])), symm=T)

## Bonus 2:
# If you ever find yourself dealing with a correlation matrix (a matrix of correlation values between a bunch of variables; usually via the cor() function in R), then you might be interested in the corrplot package. It is similar to heatmaps, but generates plots with correlation matrices in mind. Do an install.package("corrplot") and try the following
library("corrplot")
## corrplot 0.84 loaded
corrplot(cor(iris[,1:4]) )

Words!

install.packages("wordcloud") # install a "package", a collection of functions to extend R's basic functionality; this needs to be done only once for each package.
# Let's create an object with a bunch of text:
sometext = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort. It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats—the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill—The Hill, as all the people for many miles round called it—and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had whole rooms devoted to clothes), kitchens, dining-rooms, all were on the same floor, and indeed on the same passage. The best rooms were all on the left-hand side (going in), for these were the only ones to have windows, deep-set round windows looking over his garden, and meadows beyond, sloping down to the river."

# Now let's do some very basic preprocessing to be able to work with the words in the text:
clean = gsub("[[:punct:]]", "", sometext) # remove punctuation (that weird thing inside the gsub (R's find-and-replace command) is a regular expression; don't ask, it just works)
cleanlow = tolower(clean) # make everything lowecase
words = strsplit(cleanlow, split=" ")[[1]]
# Inspect the object we just created. It should be a vector of 232 words.

# Some ways to inspect and visualize textual data
sortedwords = sort(table(words)) # counts the words and sorts them
plot(sortedwords, xaxt="n")
axis(1, 1:length(sortedwords), names(sortedwords), las=2, cex.axis=0.5) # add the words

# Time to use the wordcloud package we installed earlier
library("wordcloud") # load the package (need to be done again if you start R again)
## Loading required package: RColorBrewer
wordfreqs = as.numeric(sortedwords) # get the frequencies from the table object
wordcloud(words = names(sortedwords), freq=wordfreqs, min.freq = 0)

# Note: if R gives you errors (saying word x could not fit), ignore them. Also, if plots look strange after using wordcloud, use the dev.off() command to reset graphics.
## Bonus: further cleaning for nicer clouds
# Ideally we would remove stopwords (the, and...) before plotting things like wordclouds. There are packages that do that (tm, text2vec) in various ways, or you could write some code to do it yourself (using clever math to give greater weight to "context" words, removing stopwords using lists or regular expressions, etc).
# A very simple approach is to just remove all short words:
sortedwords2 = sortedwords[which(nchar(names(sortedwords))>4 ) ] # look up the help files of the commands used here if you'd like to understand how this works exactly
wordcloud(names(sortedwords2), as.numeric(sortedwords2), min.freq = 0, col=terrain.colors(10))

# terrains.colors() is a "palette", a function that generates a number of colors from a predefined set

The end.

That’s it for today!

Do play around with these things later when you have time, and look into the bonus sections for more cool stuff. If you get stuck, Google is your friend; also, check out www.stackoverflow.com - this site is a goldmine of programming (including R) questions and solutions. If you get into making more plots, you might also want to look into the ggplot2 package, which offers an alternative way of making plots, which some people find more intuitive than the basic R way.

Also, if you get around analysing your own data and need help in terms of presenting and writing about the results of your analyses, choosing the right graph type, what to plot, and how to plot the things you want to plot - feel free to book a session with me through the PPLS Writing Centre.

But wait! There’s one more thing to do (note that unfortunately this will not work if you have errors in your code - usually marked by little red x signs on the left side vertical bar). Since this is an R Markdown document, we can “knit” it into a nice HTML (or PDF, or Word) report file - it will show both the code and the plots produced by the code. To do that, click the “Knit” button (with the little blue ball of yarn) above the script window. If the code is without errors, an HTML document will appear.


Bonus: things to try out at home.

Here are some more things you can try out at home later.

Small note: if you try knitting the RMarkdown file again later and would like to see output from the bonus sections, set eval=TRUE in these blocks, which will allow them to be rendered (all bonus blocks currently have the eval parameter set as FALSE). Do not set eval to TRUE in the small blocks with the install.packages() commands though. You might have also noticed the echo=F parameter - this just means the code itself will not be rendered in the knit output.

Plotting social networks

# Feel free to set eval=T to render the blocks herein when knitting to html.
library("igraph")  # load the package; this needs to be done once after starting R/RStudio
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
# Create an object with some Scottish people (could be a sample in a sociolinguistic study):
scots=c("Angus","Archibald","Baldwin","Boyd","Cinead","Craig","Diarmid","Donald","Duncan","Eachann","Erskine","Ethan","Fergus","Fingal","Fraser","Hamilton","Iain","Irvine","James","Muir","Mungo","Owen","Raibert", "Lyall", "Margaret", "Mairi", "Morag", "Murdina","Rhona", "Sorcha", "Thomasina","Una") 
nscots = length(scots) # record the number of people in an object
mates = matrix(sample(0:1,nscots^2,T,prob=c(0.95,0.05)), ncol=nscots, nrow=nscots, dimnames=list(scots, scots)) # this creates a randomized matrix signifying friendships; no need to think about this too hard for now
mates[1:10,1:10] # but have a look at it anyway; '1' means these two people know each other; this line prints the first 10 rows and 10 columns
##           Angus Archibald Baldwin Boyd Cinead Craig Diarmid Donald Duncan
## Angus         0         0       1    0      0     0       0      0      0
## Archibald     0         0       0    0      0     0       0      0      0
## Baldwin       0         0       0    0      0     0       0      0      0
## Boyd          0         0       0    0      0     0       0      0      0
## Cinead        0         0       0    0      0     0       0      0      0
## Craig         0         0       0    0      0     0       0      0      0
## Diarmid       0         0       0    0      0     0       0      0      1
## Donald        0         0       0    0      0     0       0      1      0
## Duncan        0         0       0    0      0     0       0      0      0
## Eachann       0         0       0    0      0     0       0      0      0
##           Eachann
## Angus           0
## Archibald       1
## Baldwin         0
## Boyd            0
## Cinead          0
## Craig           0
## Diarmid         0
## Donald          0
## Duncan          0
## Eachann         0
g = graph_from_adjacency_matrix(mates, mode = "undirected", diag=FALSE) # creates a graph object; this need to have igraph loaded to work

plot(g)  # works, but not very nice looking

# Let's modify the plotting parameters, and define the gender values in the sample
gender = c(rep("m", nscots-9), rep("f", 9)) # create a vector of gender labels for the names
plot(g, vertex.size=4, vertex.color="lightgray", vertex.frame.color=NA, vertex.label.cex=0.9, vertex.label.color=ifelse(gender=="m", yes="blue",no="tomato"), vertex.label.dist=0.1, vertex.label.font=2, edge.color=rgb(0,0,0,0.3))

X marks the spot: making maps in R

Making maps programmatically based on data would come in handy if your worked with demographic data, or dialects, areal sociolinguistics, etc.

library("raster")
## Loading required package: sp
uk = getData("GADM", country = "United Kingdom", level = 2) # download UK map (needs the raster package to be loaded)
par(mar=c(0,0,0,0)) # change plot area for better visibility - mar defines the margins, which are usually used for the axes and labels (reset using dev.off() later)
plot(uk, lwd=0.4) # plot the UK
grid()            # add grid
points(-3.1833, 55.9533, pch=20, col="blue", cex=2) # your current location :)

plot(uk, lwd=0.4, xlim=c(-4,-1), ylim=c(55.7,55.8)); points(-3.1833, 55.9533, pch=20, col="blue", cex=2) # zoom in

Interactive plots in 3D!

A quick look at a package that creates interactive (clickable, rotatable, etc) plots, both in 2D and 3D - these also work in web pages (like the html file you could create by knitting this script file; R Markdown can also be used to create slides, meaning you could easily include interactive graphs in your next presentation).

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:raster':
## 
##     select
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# another look at the iris data:
plot_ly(x= iris$Petal.Length,y=iris$Sepal.Width, z=iris$Sepal.Length, type="scatter3d",mode="markers",color=iris$Species) %>% layout(scene = list(yaxis=list(title="Sepal width"),xaxis=list(title="Petal length"), zaxis=list(title="Sepal.Length")))
# Remember the rgb color plots with the black background from earlier? Given there are 3 colors, they really would make more sense in 3D, no?
red=runif(1000);green=runif(1000);blue=runif(1000)
plot_ly(x=red,y=green, z=blue, type="scatter3d",mode="markers", color=I(apply(cbind(red, green, blue,0.5),1, function(x) rgb(x[1],x[2], x[3])))) %>% layout(paper_bgcolor='black', scene = list(xaxis=list(title="red"),yaxis=list(title="green"), zaxis=list(title="blue")))

Last but not least - getting your own data into R and getting plots out of R.

Once you get around to working with your own data, you’ll need to import it into R to be able to make plots based on it. There are a number of ways of doing that.

Table (csv, Excel, txt) into R, import from file

This is probably the most common use case. If your data is in an Excel file formal (.xls, .xlsx), you are better off saving it as a plain text file (although there are packages to import directly from these formats, as well as from SPSS .sav files). The commands for that are read.table(), read.csv() and read.delim(). They basically all do the same thing, but differ in their default settings.

# an example use case with parameters explained
mydata = read.table(file="path/to/my/file.txt", # full file path as a string
                    header=T,      # if the first row contains column names
                    row.names=1,   # if the 1st (or other) column is row names
                    sep="\t",      # what character separates columns in the text file*
                    quote="",      # if quotes in text in any columns, set this to ""
                    )
# * "\t" is for tab (default if you save a text file from Excel), "," for csv, " " if space-spearated, etc
# for more and to check the defaults, see help(read.table)
# the path can be just a file name, if the file is in the working (R's "default") directory; use getwd() to check where that is, and setwd(full/path/to/folder) to set it (or you can use RStudio's Files tab, click on More)

Importing from clipboard

There is a simple way to import data from the clipboard. While importing from files is generally a better idea (you can always re-run the code and it will find the data itself), sometimes this is handy, like quickly grabbing a little piece of table from Excel. It differs between OSes:

mydata = read.table(file = "clipboard")       # in Windows (add parameters as necessary)
mydata = read.table(file = pipe("pbpaste"))   # on a Mac

Importing text

For text, the readLines() command usually works well (its output is a character vector, so if the text file has 10 lines, then readLines produces a vector of length 10, where each line is an element in that vector (you could use strsplit() to further split it into words. If the text is organized neatly in columns, however, you might still consider read.table(), but probably with the stringsAsFactors=F parameter (this avoids making long text strings into factors, read up on it if needed).

Exporting plots

RStudio has handy options to export plots - click on Export on top of the plot panel, and choose the output format. Plots can be exported using R code as well - this is in fact a better approach, since otherwise you would have to click through the Export… menus again every time you change your plot and need to re-export. Look into the help files of the jpeg and pdf functions to see how this works.

Anything else

There are also packages to import (and manipulate) images, GIS map data (shapefiles), data from all sorts of other file formats (like XML, HTML) and many more. Just google a bit and you’ll find what you need.